90 research outputs found

    RELcat: a Relation Registry for ISOcat data categories

    No full text
    The ISOcat Data Category Registry contains basically a flat and easily extensible list of data category specifications. To foster reuse and standardization only very shallow relationships among data categories are stored in the registry. However, to assist crosswalks, possibly based on personal views, between various (application) domains and to overcome possible proliferation of data categories more types of ontological relationships need to be specified. RELcat is a first prototype of a Relation Registry, which allows storing arbitrary relationships. These relationships can reflect the personal view of one linguist or a larger community. The basis of the registry is a relation type taxonomy that can easily be extended. This allows on one hand to load existing sets of relations specified in, for example, an OWL (2) ontology or SKOS taxonomy. And on the other hand allows algorithms that query the registry to traverse the stored semantic network to remain ignorant of the original source vocabulary. This paper describes first experiences with RELcat and explains some initial design decisions

    Towards standardized descriptions of linguistic features: ISOcat and procedures for using common data categories

    No full text
    Automatic Language Identification of written texts is a well-established area of research in Computational Linguistics. State-of-the-art algorithms often rely on n-gram character models to identify the correct language of texts, with good results seen for European languages. In this paper we propose the use of a character n-gram model and a word n-gram language model for the automatic classification of two written varieties of Portuguese: European and Brazilian. Results reached 0.998 for accuracy using character 4-grams

    Linking to linguistic data categories in ISOcat

    No full text
    ISO Technical Committee 37, Terminology and other language and content resources, established an ISO 12620:2009 based Data Category Registry (DCR), called ISOcat (see http://www.isocat.org), to foster semantic interoperability of linguistic resources. However, this goal can only be met if the data categories are reused by a wide variety of linguistic resource types. A resource indicates its usage of data categories by linking to them. The small DC Reference XML vocabulary is used to embed links to data categories in XML documents. The link is established by an URI, which servers as the Persistent IDentifier (PID) of a data category. This paper discusses the efforts to mimic the same approach for RDF-based resources. It also introduces the RDF quad store based Relation Registry RELcat, which enables ontological relationships between data categories not supported by ISOcat and thus adds an extra level of linguistic knowledge

    FLAT: A CLARIN-compatible repository solution based on Fedora Commons

    No full text
    This paper describes the development of a CLARIN-compatible repository solution that fulfils both the long-term preservation requirements as well as the current day discoverability and usability needs of an online data repository of language resources. The widely used Fedora Commons open source repository framework, combined with the Islandora discovery layer, forms the basis of the solution. On top of this existing solution, additional modules and tools are developed to make it suitable for the types of data and metadata that are used by the participating partners

    Knowledge management for small languages

    No full text
    In this paper an overview of the knowledge components needed for extensive documentation of small languages is given. The Language Archive is striving to offer all these tools to the linguistic community. The major tools in relation to the knowledge components are described. Followed by a discussion on what is currently lacking and possible strategies to move forward

    ISOcat: Remodeling metadata for language resources

    No full text
    The Max Planck Institute for Psycholinguistics in Nijmegen, The Netherlands, is creating a state-of-the-art web environment for the ISO TC 37 (terminology and other language and content resources) metadata registry. This Data Category Registry (DCR) is called ISOcat and encompasses data categories for a broad range of language resources. Under the governance of the DCR Board, ISOcat provides an open work space for creating data category specifications, defining Data Category Selections (DCSs) (domain-specific groups of data categories), and standardising selected data categories and DCSs. Designers visualise future interactivity among the DCR, reference registries and ontological knowledge space

    Content-based video indexing for the support of digital library search

    Get PDF
    Presents a digital library search engine that combines efforts of the AMIS and DMW research projects, each covering significant parts of the problem of finding the required information in an enormous mass of data. The most important contributions of our work are the following: (1) We demonstrate a flexible solution for the extraction and querying of meta-data from multimedia documents in general. (2) Scalability and efficiency support are illustrated for full-text indexing and retrieval. (3) We show how, for a more limited domain, like an intranet, conceptual modelling can offer additional and more powerful query facilities. (4) In the limited domain case, we demonstrate how domain knowledge can be used to interpret low-level features into semantic content. In this short description, we focus on the first and fourth item

    Ensuring semantic interoperability on lexical resources

    Get PDF
    In this paper, we describe a unifying approach to tackle data heterogeneity issues for lexica and related resources. We present LEXUS, our software that implements the Lexical Markup Framework (LMF) to uniformly describe and manage lexica of different structures. LEXUS also makes use of a central Data Category Registry (DCR) to address terminological issues with regard to linguistic concepts as well as the handling of working and object languages. Finally, we report on ViCoS, a LEXUS extension, providing support for the definition of arbitrary semantic relations between lexical entries or parts thereof
    • …
    corecore